home *** CD-ROM | disk | FTP | other *** search
- .pn1
- .po4
- A VERY SIMPLE CHART PARSER
-
- Peter Ross
- Dept. of AI, University of Edinburgh,
- 80 South Bridge,
- Edinburgh EH1 1HN,
- Scotland.
-
- This describes how to use the (very) simple chart parser written in Prolog
- in the file "chart.pro". You will need LOTS of stack space - the program
- does no assertion or retraction at run time, but keeps all the run-time
- data in lists.
-
- The program expects data in the form of Prolog clauses:
-
- [1] rule(Tag, Category, ExpansionList).
- The 'Tag' argument is just an arbitrary Prolog term used to group
- together a set of rule/3 and lexical/3 clauses. The parser needs
- to be told this tag in order to know which set of these clauses
- to use as data. Since it is arbitrary, you could use a compound
- term to allow you to selectively include rules, at no extra cost.
- The 'Category' argument identifies a syntactic category, e.g.
- noun or sentence.
- The 'ExpansionList' is a list giving one possible expansion of that
- category into sub-categories.
- Example:
- rule(simple, sentence, [noun_phrase, verb_phrase]).
- The categories (both Category and those in ExpansionList) can be
- compound terms, thus allowing the use of variables for result passing
- and context setting. The parser copies terms as needed, when
- unification might be a bad thing to do because it might cause
- supposedly independent data structures to become too specific.
- This copying, by the way, is done by dismantling the source structure
- and building a duplicate rather than by the crude assert/retract
- method; the latter tends to be much slower but takes less run-time
- space.
-
- NOTE: categories are not supposed to be defined recursively.
- Nevertheless, the code does check whether you are adding an edge you
- already have. The check does slow things a bit, but is specifically
- a check against left recursion at the time when empty active edges
- are added. A fuller check against adding duplicate inactive edges
- has been commented out; you probably don't need it unless you mess
- with the rule tags at run time so as to cause new parsing rules to
- be introduced during parsing.
-
- [2] lexical(Tag, Word, CategoryList).
- The 'Tag' argument is as above.
- The 'Word' is a word that may legitimately appear in the input.
- The 'CategoryList' is a list of the possible syntactic categories
- which that word might fit, e.g. 'fly' might be a noun or verb.
- As above, the categories can be arbitrary terms.
-
- [3] initial_category(Tag, Category).
- This states what the highest level category is - it should be unique.
- è
- [4] strategy(Tag, Strategy).
- This should instantiate Strategy to either 'td' for top-down or
- 'bu' for bottom-up. The parser does not check it for validity.
-
- [5] policy(Tag, Policy).
- This should instantiate Policy to either 'df' for depth-first or
- 'bf' for breadth-first. The parser does not check it for validity.
-
- [6] ersatz_category(Tag, DummyCategoryName).
- By default, the category name 'user' is specially used by the system.
- It supposes that there is an extra rule of the form
- rule(Tag, user, [TopLevelCategory])
- when doing top-down parsing. This extra category will of course appear
- in the final chart, and should help in the later analysis of it. If
- you are really attached to the category name 'user' but want to do
- some top-down parsing, then you can persuade the system to use a
- different name instead by including a clause of this form.
- Although this ersatz category can be useful, its original purpose
- was to make the program shorter at negligible speed cost by ensuring
- that the top-level category appeared on the left of a single rule.
-
- IMPORTANT: The rule/3 and lexical/3 predicates are only used during the
- initialisation stage, when the transformed rules are constructed
- and when the initial chart is created.
-
- There are three important data structures manipulated by the program:
-
- [a] edge(Category, FoundList, NeedsList, StartVertex, EndVertex).
- This describes an edge in the chart. If you don't know what this
- means, read Winograd's book "Language As A Cognitive Process, vol 1"
- or the articles by Henry Thompson and Graeme Ritchie in "Artificial
- Intelligence: Tools, Techniques and Applications" by O'Shea and
- Eisenstadt (published by Harper and Row).
- There are three odd details. Items on the 'FoundList' are of the form
- Category = VertexNumber
- (the starting, left-hand vertex of that instance of the category).
- A word from the input will appear in the 'FoundList' as
- word(Word) = VertexNumber
- The 'FoundList' is ordered so that the most recently found category is
- first.
-
- [b] the chart: ActiveEdgeList + InactiveEdgeList.
- The edges are segregated into two lists for convenience.
- The parser returns such a chart when it finishes.
-
- [c] the agenda: CandidateList - Hole.
- This is a difference list (a list with a variable at the end - 'Hole'
- is the variable). The items in the list are of the form
- ActiveEdge + InactiveEdge
- and are candidates (guaranteed successful, in fact) for an application
- of the fundamental rule of chart parsing.
- The default monitoring system prints out agendas.
-
- The program has two top level predicates. They are:
- è
- [i] parse(Tag, WordList, MaxVertex, Chart).
- The Tag and WordList must be given. The MaxVertex and Chart must be
- variables. When the parse is done, MaxVertex will be instantiated to
- the largest vertex number (the lowest is 0), and Chart will be
- instantiated to the final chart. When doing top-down parsing, the
- parser adds one ersatz rule of the form
- rule(Tag,user,[TopMostCategory])
- - there will be edges for the ersatz category 'user' to help you to
- examine the chart afterwards for successful parsings. You can
- substitute what you want for 'user' - see above. This predicate does
- some transformations on the rule/3 clauses supplied by the user. The
- transformations are all done by the predicate invert_rules(Tag).
-
- [ii] make_chart(Tag, WordList, MaxVertex, Chart).
- This is exactly like parse/4 above, but assumes that the rule
- transformations have already been done.
-
- When a parse has been done and a final chart has been produced, you may
- want to add to the word list and extend the chart, for example if using the
- parser for plan recognition. You can do so by the following means: call
- initial_setup(Tag, Strategy, Policy, NewWordList,
- MinVertex, NewMaxVertex,
- OldChart, NewInitialChart,
- AgendaOfYourChoice, NewAgenda),
- chart(Tag, Strategy, Policy, NewInitialChart, NewAgenda, FinalChart)
- This will assume that more words are added between MinVertex and NewMaxVertex,
- and will add to the OldChart and AgendaOfYourChoice you gave it, to create
- a new FinalChart. Since all the information needed about the presence of
- any previous set of words (presumably ending at MinVertex) will be contained
- in the OldChart you got out of parsing that previous word sequence, this will
- incrementally add to the chart.
-
- Normally parsing stops when the parser runs out of agenda. You may discover
- a need to suspend parsing under program control, for example if you want
- to doctor the chart or extract useful information from it before the entire
- input sequence is available. If so, you may redefine the test
- stop_parser(Tag,Strategy,Policy,Chart,Agenda,FinalChart)
- which currently does no more than instantiate the result FinalChart to
- Chart when Agenda is empty. It is the success of this predicate which halts
- parsing.
-
- The implementation of the fundamental rule is explicit within the code, so
- that it is easy to change for your own purposes.
-
-
- There is a simple monitoring mechanism. The predicates
- watch(Tag) nowatch(Tag)
- turn it on or off for the specified rule set. If on, it reports when the
- rule transformations are complete, and whenever an item on the agenda is
- processed, it tries to call the predicate
- user_mon(Tag,Strategy,Policy,OldChart,OldAgenda,NewChart,NewAgenda)
- If this fails, it uses a default strategy of printing out the new chart and
- agenda and waiting for a <cr> before continuing.
-
- èThere is a simple test rig, called by the predicate
- test(Tag).
- It makes use of a trivial utility
- print_chart(Chart)
- to print out the active and inactive edges after the parse has finished.
- The lists of edges are sorted before being printed, into the standard order
- of Prolog terms.
-
- A sample rule set can be found in the files "test01.pro" and "test02.pro".
-
-
- IMPLEMENTATION: SOME THOUGHTS
-
- This program clearly shows the need for some kind of user-definable indexing
- in Prolog in order to make such things efficient. Unfortunately record/3 is
- not much use, because the key can only usefully be atomic (only the principal
- functor name matters) and in C-Prolog the key cannot be an integer (nor is
- record/3 that fast in C-Prolog). At present all you've really got is the
- clause indexing mechanism, so you're stuck with assert/retract and all that
- that implies if you want to cut down on searching through big data structures
- such as long lists. Gimme arrays, gimme hunks, gimme a decent record package,
- gimme hash tables, gimme something!! Prolog-2 does let you specify hash tables
- for particular predicates, but these are only used by the calling mechanism.
- Wouldn't it be nice to have some way of specifying that anything with a given
- principal functor was to be accessed through a set of hash tables,
- sub-indexing specifiable by the user in some way, e.g. akin to that offered
- by PEARL for frame access? Some other Prologs (e.g. Arity Prolog) do now
- offer such things as B-trees and hash tables for term storage.
-
- At the moment, for simplicity, the chart is only represented as two lists -
- one of active edges, one of inactive edges. Since the chart only grows, never
- shrinks, it might be better to represent it as two ordered trees peppered with
- holes (variables). This ought to speed things up a lot.
-
- I considered the idea of not allowing Prolog variables in the terms which
- represent categories; this would end all temptation to misuse unification.
- Instead, I have in the past used terms of the form ?Atom as variables (where
- ?/1 is defined as a prefix operator, none of this messy atom exploding done
- by LISPers who don't/can't alter their character tables). This leads to
- passing binding lists about a lot. As far as the ersatz unification in edge
- generation goes, there is no significant speed difference between using
- Prolog variables and these pseudo-variables. However, there is a difference
- in speed when looking for candidate edge pairs, since with Prolog variables
- I can just use the dodge of specifying \+(\+(Category1=Category2)), which
- checks whether they will unify, without doing so, in a reasonably efficient
- way. With pseudo-variables I'd need to indulge in term smashing or suffer
- the overhead of carrying binding lists about in the chart. The latter is not
- so awful as it might look, since it would be possible to construct the unifier
- at the same time as checking candidacy, and just carry the unifier about till
- needed. But, this still looks a bit worse than using Prolog variables.
-
-